Pediatric acute myeloid leukemia (pAML) encompasses over 20 molecular subtypes driven by unique genetic alterations, including hallmark chromosomal rearrangements and less frequently, point mutations or tandem duplications. Collectively, many of these pAML molecular categories are enriched in pediatric populations and are not represented by current classification systems, including the recently updated WHO or ICC. While fusion detection from RNA-Seq-based approaches is robust, many fusion negative subtypes would need to be defined by expression-based approaches as mutation calling from RNA-Seq data is less developed. Nevertheless, this could be challenging for subtypes with similar transcriptional profiles, such as those with shared HOX expression patterns, including NPM1, NUP98r, UBTF, DEK::NUP214, and KMT2A-PTD. To aid in the appropriate molecular classification of pAML, which is crucial for prognosis, we developed and compared three gene expression-based classifiers.

A total of 1707 pAML gene expression profiles were mapped and analyzed from three distinct sources (St. Jude = 659; TARGET = 168; AAML1031 = 880). Raw read count data was normalized and scaled to obtain a relative expression value in transcripts per million (TPM), which served as input for feature selection, model training, and testing. Ground truth labels for all 1707 samples were obtained through multi-omics analysis, including whole genome sequencing, to identify fusions and mutations. For validation purposes, the data was stratified by subtypes and split 70/30 into training (n=1187) and testing (n=520).

Three machine learning models were selected: random forest, XGboost, and linear support vector machine (SVM). Each sample had gene expression TPM data for 60,754 transcripts, out of which 20,004 transcripts related to protein-coding genes were incorporated for feature selection. Feature selection was performed using a median absolute deviation (MAD) algorithm to select the 3000 transcripts with the highest variability. Top variable 3000 genes were selected to allow for adequate tuning of the number of predictors. Each model was independently trained using stratified cross-validation and Monte-Carlo search for hyperparameter tuning. The best model was selected based on the Matthews Correlation Coefficient (MCC). Each model was tested on the hold-out TPM set with z-score normalization.

On the hold-out testing set (n=520), the linear SVM model outperformed the random forest and XGboost models on five performance metrics across all subtypes (sensitivity=0.9577; precision=0.9577; specificity=0.9978; F1=0.96; accuracy=0.9958). The random forest (sensitivity=0.9154; precision=0.9154; specificity=0.9955; F1=0.92; accuracy=0.9915) and XGboost models (sensitivity=0.9231; precision=0.9231; specificity=0.9960; F1=0.92; accuracy=0.9923) also performed well across all subtypes. Although feature selection was shared across all three models, performance within each subtype varied between models. The linear SVM model demonstrated strong performance overall, driven by high specificity in classifying the KMT2Ar subgroup (n=127) and equal sensitivity across the GLIS-rearranged (n=14), GATA1 (n=10), BCL11B (n=7), CBFB::MYH11 (n=60), CEBPA (n=33), and RUNX1::RUNX1T1 (n=78) subtypes.

The primary difference in performance between models is the high false positive rate for KMT2Ar and NPM1 (n=55) in the random forest and XGboost models. A preliminary hypothesis for this might be due to the large representation of KMT2Ar and NPM1 in the training data (24.85% and 10.78%, respectively). Synthetic upsampling (SMOTE) for the training dataset (n=1369) counteracts bias towards the majority classes and increases performance for the random forest (sensitivity=0.9327; precision=0.9327; specificity=0.9965; F1=0.93; accuracy=0.9933), XGboost (sensitivity=0.9404; precision=0.9404; specificity=0.9969; F1=0.94; accuracy=0.9940) and linear SVM (sensitivity=0.9615; precision=0.9615; specificity=0.9980; F1=0.96; accuracy=0.9962) models.

Conjointly, these models demonstrate the utility and effectiveness of a machine learning approach for classifying pAML samples from transcriptome sequencing data, which may have broad clinical and research utility, especially for fusion negative subtypes.

No relevant conflicts of interest to declare.

Sign in via your Institution